This is the title

Tony Liang

University of British Columbia

December 31, 2023

About myself

  • PhD student in Bioinformatics under Dr. Amrit Singh supervision

  • BSc. in Math + minor in Data Science

  • What does “bioinformatician” do?

    • Assist researchers like you to better understand what you’re data means
    • We’re just coders that know little bit more bio
  • Currently working in creating pipeline, tools, models analyzing biological data in an automated-fashion

    • Focused on machine learning & AI
    • Reproducible workflows

Single Cell RNA-seq

What is single cell RNA sequencing?

Loading data

The data is from a study (Kang et al. 2018) and publicly avaliable through R’s ExperimentHub function

eh <- ExperimentHub() #  Initialize the hub as some list object
sce <- eh[["EH2259"]] # We could then extract the match entry by taking this entry from out hub
# Then print it
sce
class: SingleCellExperiment 
dim: 35635 29065 
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
  • After loaded data should inspect basic information
  • What are rows? column? size of data?
  • Rows = genes
  • Columns = cells

Preprocessing of data before analysis

The data retrieved is rawest form, not all of it is useful.

Caveats:

  • Undetected genes
  • Cells with very few or many detected genes
  • Lowly expressed genes
  • unnormalized expression values

This is usually the quality control (QC) step. This could potentially be another tutorial, so not deeply covered today.

We only perfom simple actions

Remove undetected genes

sce <- sce[rowSums(counts(sce) > 0) > 0, ]
sce
class: SingleCellExperiment 
dim: 18890 29065 
metadata(0):
assays(1): counts
rownames(18890): RP11-34P13.8 AL627309.1 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
  • Reduced from \(35635\) genes to \(18890\), nearly 16 genes that were not detected in any cells

Remove cells with few or many detected genes

# We need additional support to compute per cell quality control metrics from scater package
qc <- scater::perCellQCMetrics(sce)
# remove cells with few or many detected genes
ol <- scater::isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
# Then we remove those are consider as outlier
sce <- sce[, !ol] # This means retain column that are not ol
dim(sce)
[1] 18890 26820
  • Now we changed 29065 cells to 26820 cells, where these cells are either overly abundant or too few

Remove lowly expressed genes

# Similar to early, see pattern now with rowSums
# But we want to at least have 10 cells that high expression value, this threshold could change
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]
sce
class: SingleCellExperiment 
dim: 7118 26820 
metadata(0):
assays(1): counts
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
  • Interesting now, we only retained 7118 genes from original 35K and 26820 cells from original 29K.
  • This would be our final “filtered data”, but not entirely ready for analysis
    • Need to normalize these expression values

Normalize expression values

Calculate a log2-transformed normalized expression values:

  • dividing each count by its size factor
  • adding pseudo count of 1
  • log transforming
# compute sum-factors & normalize
sce <- scater::computeLibraryFactors(sce)
sce <- scater::logNormCounts(sce)
sce
class: SingleCellExperiment 
dim: 7118 26820 
metadata(0):
assays(2): counts logcounts
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(6): ind stim ... multiplets sizeFactor
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

This is our finalized data that have gone through series of QC steps. Now, we move to muscat

MUSCAT

multi-sample multi-group scRNA-seq analysis tools (Crowell et al. 2020)

It expects a SCE and requires cell metadata columns to have:

  • sample_id : sample identifier i.e. Nautilus_trt_3
  • cluster id: subpopulation (cluster assignment) i.e. T cells, monocytes
  • group id: experimental group/condition i.e. control/treatment, healthy/diseased
sce$id <- paste0(sce$stim, sce$ind)
sce <- muscat::prepSCE(sce, 
    kid = "cell", # subpopulation assignments
    gid = "stim",  # group IDs (ctrl/stim)
    sid = "id",   # sample IDs (ctrl/stim.1234)
    drop = TRUE)  # drop all other colData columns
sce
class: SingleCellExperiment 
dim: 7118 26820 
metadata(1): experiment_info
assays(2): counts logcounts
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(3): cluster_id sample_id group_id
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

Reference

Crowell, Helena L, Charlotte Soneson, Pierre-Luc Germain, Daniela Calini, Ludovic Collin, Catarina Raposo, Dheeraj Malhotra, and Mark D Robinson. 2020. “Muscat Detects Subpopulation-Specific State Transitions from Multi-Sample Multi-Condition Single-Cell Transcriptomics Data.” Nature Communications 11 (1): 6077.
Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94.